Pengantar Pemrograman Triton: Dari Operator Eager ke Paralelisme Berbasis Blok

Mengalihkan dari Mode Eager PyTorch ke Triton membutuhkan perubahan pandangan dari melihat tensor sebagai objek monolitik menjadi melihatnya sebagai kumpulan blok yang terpisah dan dapat dikelola blok atau ubin.

1. Tensor PyTorch vs. Triton

Sangat penting untuk membedakan tensor Triton dengan tensor PyTorch. Tensor PyTorch adalah objek Python sisi host yang membungkus bentuk, dtype, perangkat, stride, dan metadata penyimpanan. Sebaliknya, Triton bekerja dengan penunjuk data mentah di dalam blok memori tertentu, memungkinkan optimasi tingkat yang jauh lebih rendah.

2. Hambatan Mode Eager

Dalam eksekusi eager standar, setiap operasi (misalnya, Penjumlahan lalu ReLU) memerlukan peluncuran kernel terpisah dan perjalanan bolak-balik memori global. Ini adalah hambatan utama dalam komputasi GPU modern. Triton mengatasinya dengan menggabungkan operasi dalam satu kernel yang memproses blok data (misalnya, 128, 256, atau 512 elemen) langsung di memori internal chip.

3. Paradigma Berbasis Blok

Alih-alih berpikir pada tingkat skalar seperti thread CUDA, Triton menggunakan SPMD (Program Tunggal, Data Ganda) pada tingkat blok. Anda menulis satu kernel, dan Triton meluncurkan banyak instans di seluruh kisi. Setiap instans menggunakan program_id untuk menghitung blok memori mana yang dimilikinya.

4. Pengaturan Lingkungan

Untuk memulai, pasang Triton di lingkungan bersih (menggunakan Conda atau venv) untuk memastikan tidak ada konflik dependensi dengan toolkit CUDA yang sudah ada: pip install triton.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.